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Abstract. We use the method of Maximum (relative) Entropy to process information in the form of 
observed data and moment constraints. The generic "canonical" form of the posterior distribution 
for the problem of simultaneous updating with data and moments is obtained. We discuss the 
general problem of non-commuting constraints, when they should be processed sequentially and 
when simultaneously. As an illustration, the multinomial example of die tosses is solved in detail 
for two superficially similar but actually very different problems. 
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INTRODUCTION 

The original method of Maximum Entropy, MaxEnt [1], was designed to assign proba- 
bilities on the basis of information in the form of constraints. It gradually evolved into a 
more general method, the method of Maximum relative Entropy (abbreviated ME) [2]- 
[6], which allows one to update probabilities from arbitrary priors unlike the original 
MaxEnt which is restricted to updates from a uniform background measure. 

The realization [5] that ME includes not just MaxEnt but also Bayes' rule as special 
cases is highly significant. First, it implies that ME is capable of reproducing every as- 
pect of orthodox Bayesian inference and proves the complete compatibility of Bayesian 
and entropy methods. Second, it opens the door to tackling problems that could not be 
addressed by either the MaxEnt or orthodox Bayesian methods individually. The main 
goal of this paper is to explore this latter possibility: the problem of processing data plus 
additional information in the form of expected values. 4 

When using Bayes' rule it is quite common to impose constraints on the prior distri- 
bution. In some cases these constraints are also satisfied by the posterior distribution, but 
these are special cases. In general, constraints imposed on priors do not "propagate" to 
the posteriors. Although Bayes' rule can handle some constraints, we seek a procedure 
capable of enforcing any constraint on the posterior distributions. 

After a brief review of how ME processes data and reproduces Bayes' rule, we de- 
rive our main result, the general "canonical" form of the posterior distribution for the 
problem of simultaneous updating with data and moment constraints. The final result 
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is deceivingly simple: Bayes' rule is modified by a "canonical" exponential factor. Al- 
though this result is very simple, it should be handled with caution: once we consider 
several sources of information such as multiple constraints we must confront the prob- 
lem of non-commuting constraints. We discuss the question of whether they should be 
processed simultaneously, or sequentially, and in what order. Our general conclusion is 
that these different alternatives correspond to different states of information and accord- 
ingly we expect that they will lead to different inferences. 

As an illustration, the multinomial example of die tosses is solved in some detail for 
two problems. They appear superficially similar but are in fact very different. The first 
die problem requires that the constraints be processed sequentially. This corresponds 
to the familiar situation of using MaxEnt to derive a prior and then using Bayes to 
process data. The second die problem, which requires that the constraints be processed 
simultaneously, provides a clear example that lies beyond the reach of Bayes' rule. 



UPDATING WITH DATA USING THE ME METHOD 

Our first concern when using the ME method to update from a prior to a posterior 
distribution is to define the space in which the search for the posterior will be conducted. 
We wish to infer something about the value of a quantity 6 E & on the basis of three 
pieces of information: prior information about (the prior), the known relationship 
between x and (the model), and the observed values of the data x G Since we 
are concerned with both x and 0, the relevant space is neither nor but the product 
Jx0 and our attention must be focused on the joint distribution P(x, 0). The selected 
joint posterior inew(*j 0) is that which maximizes the entropy, 

S[P,P id] = -JdxdG P(x,6)\og ^ X ; 6) , (1) 

subject to the appropriate constraints. All prior information is codified into the joint prior 
Po\d{x,0) = P<A&{Q)Po\&{x\Q) • Both P \&{&) (the familiar Bayesian prior distribution) 
and P \&{x\&) (the likelihood) contain prior information. 6 The new information is the 
observed data xf, which in the ME framework must be expressed in the form of a 
constraint on the allowed posteriors. The family of posteriors P(x,0) that reflects the 
fact that x is now known to be x' is such that 

p(x) = fd9P(x,9) = 8{x-x') . (2) 

This amounts to an infinite number of constraints on P(x, 6): for each value of x there is 
one constraint and one Lagrange multiplier X (x) . 



5 We use the concise notation 9 and x to represent one or many unknown variables, 9 = (9 1 , 02 • • ■)> an d 
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6 The notion that the likelihood function contains prior information may sound unfamiliar from the point 
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Maximizing S, (1), subject to the constraints (2) plus normalization, 

8{S + a[fdxd9P(x,9)-l]+fdxX(x) [fdO P(x,0) - 8(x-x')] } = , (3) 
yields the joint posterior, 

0)=P o id(x,0) , (4) 

z 

where z is a normalization constant, and X (x) is determined from (2), 

X(x) e X(x) 

JdO P o]d (x,0) = P idW = 8(x-x') . (5) 

z z 

The final expression for the joint posterior is 

PnewM) = POld{X ^)~ X,) = S(X-X')P 0M (6\X) , (6) 

and the marginal posterior distribution for 9 is 

Pnew(G) = JdxP mv/ (x, 9) = P u(e\x') , (7) 

which is the familiar Bayes' conditionalization rule. 

To summarize: P m(x, 9) = P id (x)P o ld(0 \ x ) * s updated to P new (x, 9) = P new (x)P new (9 \x) 
with Pnew (x) = 8{x — x') fixed by the observed data while P new (6\x) = Pqij ( 6 \x) remains 
unchanged. We see that in accordance with the minimal updating philosophy that drives 
the ME method one only updates those aspects of one's beliefs for which corrective new 
evidence (in this case, the data) has been supplied. 



SIMULTANEOUS UPDATING WITH MOMENTS AND DATA 

Here we generalize the previous section to include additional information about 9 in the 
form of a constraint on the expected value of some function f(9), 

fdxd6P(x,e)f(e) = (f(6))=F . (8) 

We emphasize that constraints imposed at the level of the prior need not be satisfied by 
the posterior. What we do here differs from the standard Bayesian practice in that we 
require the constraint to be satisfied by the posterior distribution. 

Maximizing the entropy (1) subject to normalization, the data constraint (2), and the 
moment constraint (8) yields the joint posterior, 

e X(x)+(5f(6) 

P mw (x, 9) = P old (x, 9) , (9) 

z 

where z is a normalization constant, 



z = Jdxd9e^ x)+l5m P oU (x 7 9) . 



(10) 



The Lagrange multipliers X(x) are determined from the data constraint, (2), 

^ = §S^y where z (^ x ')=i deem6)p o^), (id 

so that the joint posterior becomes 

e M8) 

Pnew(x,0) = 8(x-x')P old (G\x')^- . (12) 

The remaining Lagrange multiplier /3 is determined by imposing that the posterior 
^new(X 0) satisfy (8). This yields an implicit equation for /3, 

dlogZ 

-jf- = F. (13) 

Note that since Z = Z(j8,x / ) the resultant /3 will depend on the observed data x' . Finally, 
the new marginal distribution for 6 is 

Ufl) = ^)V = p ^wV • (14) 

For j3 = (no moment constraint) we recover Bayes' rule. For j5 ^ Bayes' rule is 
modified by a "canonical" exponential factor. 



COMMUTING AND NON-COMMUTING CONSTRAINTS 

The ME method allows one to process information in the form of constraints. When we 
are confronted with several constraints we must be particularly cautious. In what order 
should they be processed? Or should they be processed at the same time? The answer 
depends on the nature of the constraints and the question being asked. 

We refer to constraints as commuting when it makes no difference whether they are 
handled simultaneously or sequentially. The most common example is that of Bayesian 
updating on the basis of data collected in multiple experiments: for the purpose of 
inferring 6 it is well-known that the order in which the observed data x' = {x\,x' 27 . . .} 
is processed does not matter. The proof that ME is completely compatible with Bayes' 
rule implies that data constraints implemented through 8 functions, as in (2), commute. 
It is useful to see how this comes about. 

When an experiment is repeated it is common to refer to the value of x in the first 
experiment and the value of x in the second experiment. This is a dangerous practice 
because it obscures the fact that we are actually talking about two separate variables. 
We do not deal with a single x but with a composite x = (xi,X2) and the relevant 
space is SE\ x SEi x 0. After the first experiment yields the value x\, represented by 
the constraint c\ : P{x\) = 8(x\ —x[), we can perform a second experiment that yields 
x' 2 and is represented by a second constraint C2 : Pfa) = 8(x2—x' 2 ). These constraints 
c\ and C2 commute because they refer to different variables x\ and x%. An experiment, 



FIGURE 1. Illustrating the difference between processing two constraints C\ and C2 sequentially 
(P \d — > Pi -> Piei) and simultaneously (P id -> ^iew or P i d — > Pi -> P n ( ew). 



once performed and its outcome observed, cannot be un-performed and its result cannot 
be un-observed by a second experiment. Thus, imposing one constraint does not imply 
a revision of the other. 

In general constraints need not commute and when this is the case the order in which 
they are processed is critical. For example, suppose the prior is P \& and we receive 
information in the form of a constraint, C\ . To update we maximize the entropy S[P, P ld] 
subject to Ci leading to the posterior Pi as shown in Figure 1. Next we receive a second 
piece of information described by the constraint C%. At this point we can proceed in 
essentially two different ways: 

(a) Sequential updating. Having processed C\, we use Pi as the current prior and 

maximize S[P, P\] subject to the new constraint C2. This leads us to the posterior P^ew- 

(b) Simultaneous updating. Use the original prior P G id and maximize 5[P,P id] subject 

to both constraints C\ and C2 simultaneously. This leads to the posterior P^el- 7 

To decide which path (a) or (b) is appropriate, we must be clear about how the ME 
method treats constraints. The ME machinery interprets a constraint such as C\ in a 
very mechanical way: all distributions satisfying C\ are in principle allowed and all 
distributions violating C\ are ruled out. 

Updating to a posterior Pi consists precisely in revising those aspects of the prior 
Poid that disagree with the new constraint C\ . However, there is nothing final about the 
distribution Pi. It is just the best we can do in our current state of knowledge and we 
fully expect that future information may require us to revise it further. Indeed, when 
new information C2 is received we must reconsider whether the original C\ remains 
valid or not. Are all distributions satisfying the new C2 really allowed, even those that 

violate C\t If this is the case then the new C2 takes over and we update from Pi to P^ew- 

The constraint C\ may still retain some lingering effect on the posterior P^"l through Pi , 



At first sight it might appear that there exists a third possibility of simultaneous updating: (c) use Pi as 
the current prior and maximize S[P,Pi] subject to both constraints C\ and C2 simultaneously. Fortunately, 
and this is a valuable check for the consistency of the ME method, it is easy to show that case (c) is 

equivalent to case (b). Whether we update from P id or from Pi the selected posterior is piew- 



but in general C\ has now become obsolete. 

Alternatively, we may decide that the old constraint C\ retains its validity. The new C2 
is not meant to revise C\ but to provide an additional refinement of the family of allowed 
posteriors. In this case the constraint that correctly reflects the new information is not C2 
but the more restrictive C\ AC2. The two constraints should be processed simultaneously 

(b) 

to arrive at the correct posterior Pnew- 

To summarize: sequential updating is appropriate when old constraints become obso- 
lete and are superseded by new information; simultaneous updating is appropriate when 
old constraints remain valid. The two cases refer to different states of information and 
therefore we expect that they will result in different inferences. These comments are 
meant to underscore the importance of understanding what information is being pro- 
cessed; failure to do so will lead to errors that do not reflect a shortcoming of the ME 
method but rather a misapplication of it. 



SEQUENTIAL UPDATING: A LOADED DIE EXAMPLE 

This is a loaded die example illustrating the appropriateness of sequential updating. 
The background information is the following: A certain factory makes loaded dice. 
Unfortunately because of poor quality control, the dice are not identical and it is not 
known how each die is loaded. It is known, however, that the dice produced by this 
factory are such that face 2 is on the average twice as likely to come up as face number 
5. 

The mathematical representation of this situation is as follows. The fact that we deal 
with dice is modelled in terms of multinomial distributions. The probability that casting 
a £-sided die n times yields m ; instances for the i th face is 

PoidHG) =P i d (m l ...m k \e 1 ...e kl n) = f 9™K..9™ k , (15) 

m\\...m,k'. 

where m = (mi, . . . ,m^) with £f =1 m ; - = n, and 9 = (61, . . ., 6^) with Y%=i = 1- The 
generic problem is to infer the parameters 9 on the basis of information about moments 
of 9 and data m'. The additional information about how the dice are loaded is represented 
by the constraint (62) = 2 (65). Note that this piece of information refers to the factory 
as a whole and not to any individual die. The constraint is of the general form of (8) 

Ci:(f(6))=F where f(0) = ti f$i ■ (16) 

For this particular factory F = 0, and all / ; = except for fa = 1 and / 5 = —2. Now that 
the background information has been given, here is our first example. 

We purchase a die. On the basis of our general knowledge of dice we are led to write 
down a joint prior 

P o]d (m,0) =P oid (9)PoidH9) . (17) 

(The particular form of P u(B) is not important for our current purpose so for the sake 
of definiteness we can choose it flat.) At this point the only information we have is that 
we have a die and it came from a factory described by C\ . Accordingly, we use ME to 



update to a new joint distribution. This is shown as Pi in Figure 1 . The relevant entropy 
is 

s[p,Pou] = -Lfdo p(x,e)io g p {x ; e) , (is) 

where 

L= £ 8(X} =l mi-n) and / dO = J dOi . . .^5(lf =1 0,- - 1) , 

m m\...mi c =l 

Maximizing S subject to normalization and C\ gives the Pi posterior 

<A/(6>) 

^,0) = ^— P old (m,0), (19) 



where the normalization constant Z\ and the Lagrange multiplier X are determined from 

(20) 

The joint distribution Pi (m, 9) = P\{Q)P\ (m\6) can be rewritten as 



Zl = fd9e^P oM (e) and = F . 



Pi(m,0)=P 1 (0)P old (m|0) where Pi(0) = P old (6)—— . (21) 

A 

To find out more about this particular die we toss it n times and obtain data m' = 
(m[, . . . , m' k ) which we represent as a new constraint 

C 2 :P(m) = 8(m-m f ) . (22) 

Our goal is to infer the that apply to our particular die. The original constraint C\ 
applies to the whole factory while the new constraint C2 refers to the actual die of 
interest and thus takes precedence over C\. As n — > °« we expect C\ to become less 
and less relevant. Therefore the two constraints should be processed sequentially. 

Using ME, that is (6), we impose C2 and update from Pi(m, 0) to a new joint distri- 
bution (shown as Pnew hi Figure 1) 

Pww(m,6) = 8(m-m')Pi(e\m) . (23) 
Marginalizing over m and using (21) the final posterior for is 

PS(G) = Pl(e\m') =Pi(0)^M^ = ^e X ^P old {e)Pou{m\e) . (24) 

P\[m') Z 2 

where 

Z 2 = /J0^ e )p old (0)P old (m / |0) . (25) 

The readers will undoubtedly recognize that (24) is precisely the result obtained by 
using MaxEnt to obtain a prior, in this case Pi(0) given in (21), and then using Bayes' 
theorem to take the data into account. This familiar result has been derived in some 
detail for two reasons: first, to reassure the readers that ME does reproduce the standard 
solutions to standard problems and second, to establish a contrast with the example 
discussed next. 



SIMULTANEOUS UPDATING: A LOADED DIE EXAMPLE 



Here is a different problem illustrating the appropriateness of simultaneous updating. 
The background information is the same as in the previous example. The difference is 
that the factory now hires a quality control engineer who wants to learn as much as he can 
about the factory. His initial knowledge is described by the same prior P ld(^) (17). 
After some inquiries he is told that the only available information is Ci : (62} = 2 (65). 
Not satisfied with this limited information he decides to collect data that reflect the 
production of the whole factory. Randomly chosen dice are tossed n times yielding data 
m' = (raj , . . . , m' k ) which is represented as a constraint, 

C 2 :P(m) = 8(m-m) . (26) 

The apparent resemblance with (22) may be misleading: (22) refers to a single die, while 
(26) now refers to the whole factory. The goal here is to infer the distribution of 6 that 
describes the overall population of dice produced by the factory. The new constraint C2 
is information in addition to, rather than instead of, the old C\ : the two constraints should 
be processed simultaneously. From (12) the joint posterior is 8 

Pi£l(m,6) = 8(m-m')P M (d\m')^- . (27) 
Marginalizing over m the posterior for is 

p^i(e) = p u(e\^) e -^- = ^f^p old (e)p oM (m'\e) , (28) 

where the new normalization constant is 

C = /Je^ ) J P old (0) J P old (m / |e) and = F . (29) 

This looks like the sequential case, (24), but there is a crucial difference: /3 7^ A and £ ^ 
Z2. In the sequential updating case, the multiplier X is chosen so that the intermediate Pi 

satisfies C\ while the posterior P^ only satisfies C2. In the simultaneous updating case 

the multiplier /3 is chosen so that the posterior Pnew satisfies both C\ and C2 or C\ A C2. 
Ultimately, the two distributions P new (0) are different because they refer to different 

problems: Pntl,(0) refers to a single die, while Pntl(O) applies to all the dice produced 
by the factory. 9 



° As mentioned in the previous footnote, whether we update from P D id or from Pi we obtain the same 
posterior fiei. 

9 For the sake of completeness, we note that, because of the peculiarities of 8 functions, had the constraints 
been processed sequentially but in the opposite order, first the data C2, and then the moment C\, the 

resulting posterior would be the same as for simultaneous update to Pnti- 



SUMMARY AND FINAL REMARKS 



The realization that the ME method incorporates Bayes' rule as a special case has 
allowed us to go beyond Bayes' rule to process both data and expected value constraints 
simultaneously. To put it bluntly, anything one can do with Bayes can also be done with 
ME with the additional ability to include information that was inaccessible to Bayes 
alone. This raises several questions and we have offered a few answers. 

First, it is not uncommon to claim that the non-commutability of constraints represents 
a problem for the ME method. Processing constraints in different orders might lead to 
different inferences and this is said to be unacceptable. We have argued that, on the 
contrary, the information conveyed by a particular sequence of constraints is not the 
same information conveyed by the same constraints in different order. Since different 
informational states should in general lead to different inferences, the way ME handles 
non-commuting constraints should not be regarded as a shortcoming but rather as a 
feature of the method. 

Second, we are capable of processing both data and moments. Is this kind of infor- 
mation of purely academic interest or is it something we might encounter in real life? 
At this early stage our answer must be tentative: we have given just one example - the 
die factory - which we think is fairly realistic. However, we feel that other applications 
(e.g. in econometrics and ecology) can be handled in this way as well. [7, 8] 

Finally, is it really true that this type of problem lies beyond the reach of Bayesian 
methods? After all, we can always interpret an expected value as a sample average in 
a sufficiently large number of trials. True. We can always construct a large imaginary 
ensemble of experiments. Entropy methods then become in principle superfluous; all 
we need is probability. The problem with inventing imaginary ensembles to do away 
with entropy in favor of mere probabilities, or to do away with probabilities in favor 
of more intuitive frequencies, is that the ensembles are just what they are claimed 
to be, imaginary. They are purely artificial constructions invented for the purpose of 
handling incomplete information. It seems to us that a safer way to proceed is to handle 
the available information directly as given (i.e., as expected values) without making 
additional assumptions about an imagined reality. 

Acknowledgements: We would like to acknowledge valuable discussions with C. Ca- 
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APPENDIX: MORE ON THE MULTINOMIAL PROBLEM 

Here we pursue the calculation of the posterior (28) in more detail. To be specific we 
choose a flat prior, P o u(0) = constant. Then, dropping the superscript (b), 

We) = 7-5(L0i-i) n e^ef. (30) 

be i i= l 

where £ e differs from £ in (29) only by a combinatorial coefficient, 

C e = [s&ei-vndeiePMe?, pi) 

J i ;'=1 

and /3 is determined from (13) which in terms of £ e now reads d log £ c / <9/3 = F. A brute 
force calculation gives £ e as a nested hypergeometric series, 

C e = eM/i(/ 2 (... (4_i))), (32) 
where each / is written as a sum of T functions, 

Ij = r(bj-aj) £ r ^ + y } tf lj+i with 4=1. (33) 

The index j takes all values from 1 to k— 1 and the other symbols are defined as follows: 
tj = P (fk-j -fh),aj = m' k _j + 1 , and 

7-1 k-j-l 

b j = n + j+l + L qi - L m; , (34) 

i=Q i=Q 

with qo = m' Q = 0. The terms that have indices < are equal to zero (i.e. bo = qo = 0, 
etc.). A few technical details are worth mentioning: First, one can have singular points 
when tj = 0. In these cases the sum must be evaluated as the limit as tj — > 0. Second, 
since aj and bj are positive integers the gamma functions involve no singularities. Lastly, 
the sums converge because aj > bj. The normalization for the first die example, (25), can 
be calculated in a similar way. Currently, for small values of k (less than 10) it is feasible 
to evaluate the nested sums numerically; for larger values of k it is best to evaluate the 
integral for £ e using sampling methods. A more detailed version of the multinomial 
example is worked out in [7]. 



